株洲在线论坛招聘求职_互联网推广方案怎么写_百度竞价推广开户内容_泰安做网站公司哪家比较好

此blog为爬虫实战教学，代码已附上，可以复制运行。若要直接看实战代码翻到博客后半部分。

本文使用selenium库进行爬虫，实现爬取数据操作，此库是通过模仿用户的操作进行对页面的处理。了解了这个思维模式，可以对代码进行进一步的理解。

selenium库的操作方式可以看之前写过的爬虫登录教学~还有如何找到网页上板块的xpath的步骤。Python爬虫实战（保姆级自动登录教程）

引：

在本篇博客中，实现了登录→搜索→设置→爬取的过程。

本文以中塑行情网站进行举例进行爬取数据，网站的地址为

https://quote.21cp.com/

有小伙伴可能会好奇，为什么我要写这样一个程序。当然是因为手动太麻烦啦。看下文的爬取思路就能看出来，其实爬虫在某些步骤上就是一个模拟手动点击的过程。

在写爬虫程序之前，要先理清思路，搞明白要做哪些步骤，先后顺序是什么，然后将每个步骤封装成函数，再一步步调试编写代码。

1、爬取思路

在公司中，需要爬取中塑行情网站上一些项目的历史价格。

可以看到爬取这些项目需要的信息有四个，分别是产品名称、关键字、起始时间和终止时间。当然，也不要忘记需要登录才能看到数据，因此，我们也需要用户名和密码进行登录。

添加图片注释，不超过 140 字（可选）

现有这样一个.xslx的表格文件，包含了查找需要的项目信息所需要的背景信息。

整理爬取的思路，首先就要从手动获取目标信息开始尝试。

添加图片注释，不超过 140 字（可选）

首先，我们登陆了行情网的首页，此时点击石化出厂价会发现看不了数据，因为我们没有登陆。说明需要登录才可以爬取数据。

添加图片注释，不超过 140 字（可选）

所以我们点击登录转到登录网页（后续我们选择直接获取登录页面的网址直接进行跳转，但也可以从首页用selenium点击）。

添加图片注释，不超过 140 字（可选）

我们推荐选择二维码登录（或者账号密码+验证码），二维码登录即在手机上先下载好app进行扫码，这样的话后续代码调试的时候不用一遍遍输入账号密码和验证码，更为便捷。

登录以后根据公司的要求，需要点击石化出价场这个按钮，

添加图片注释，不超过 140 字（可选）

后面我们每一个项目的搜索起始页面都是它，所以我们也可以记下它的网站，到时候直接跳转，但如果从首页用selenium点击石化出厂价也是可以的。

添加图片注释，不超过 140 字（可选）

以第一个项目为例。

添加图片注释，不超过 140 字（可选）

首先，将关键字548r输入搜索框，然后再点击搜索按钮。

添加图片注释，不超过 140 字（可选）

然后，找到包含.xlsx表格中产品名称“镇海”的一栏，点击历史价格。

添加图片注释，不超过 140 字（可选）

可以看到显示折线图，但我们要获取的是数据，所以点击红框框出来的按钮。

然后，再日期范围中输入表格文件中标定的起始日期和终止日期，并点击查询。（输入前需要将原先默认填写好的日期删除，这是我曾经出错过的地方）

添加图片注释，不超过 140 字（可选）

翻到最底部可以看到不止一页，爬虫的时候要点击下一页，直到下一页的按钮不能被点击。

如果我们手动获取，我们需要一个个搜索，一个个输入日期并点击导出按钮，并在个人信息页上下载表格，太复杂了，但我们检查网页上的信息。

添加图片注释，不超过 140 字（可选）

右击点击检查，再点击下图所示的图标。

添加图片注释，不超过 140 字（可选）

可以看到表格中的数据都能被爬取到，且通过这种方式可以获取网页部件的Xpath（手把手获取Xpath教学可以看之前写过的爬虫登录教学博客Python爬虫实战（保姆级自动登录教程））

此时，这个项目信息就获取完了，若要搜索下一个项目信息，则点击左边栏目中的“石化出厂价”，再次回到如下页面。

添加图片注释，不超过 140 字（可选）

2、代码编写

__init__：这是类的构造函数，用于初始化实例变量。

def __init__(self):    self.status = 0         # 状态,表示如今进行到何种程度    self.login_method = 1   # {0:模拟登录,1:Cookie登录}自行选择登录方式    chrome_options = Options()    chrome_options.add_argument("start-maximized")  # 启动时最大化窗口    self.driver = webdriver.Chrome(options=chrome_options)

- self.status
  ：用于跟踪程序的当前状态。
- self.login_method
  ：选择登录方式，0代表模拟登录，1代表使用Cookie登录。
- chrome_options
  ：设置Chrome浏览器选项，如启动时最大化窗口。
- self.driver
  ：初始化一个Chrome WebDriver实例。

set_cookie：使用扫码登录获取cookie，并保存到本地文件。

def set_cookie(self):    self.driver.get(login_url)    sleep(2)  # 等待页面加载    while self.driver.title.find('账号登录-中塑在线-21世纪塑料行业门户') == -1:        sleep(1)    print('###请扫码登录###')    # 最大化浏览器窗口    self.driver.maximize_window()    sleep(2)  # 等待页面加载    while self.driver.title == '账号登录-中塑在线-21世纪塑料行业门户':        sleep(1)    print("###扫码成功###")    cookies = self.driver.get_cookies()    with open("cookies.pkl", "wb") as f:        pickle.dump(cookies, f)    print("###Cookie保存成功###")    self.driver.get(target_url)

打开登录页面，等待页面加载，并提示用户扫码登录。
最大化浏览器窗口，等待扫码成功后获取cookie。
将获取到的cookie保存到本地文件cookies.pkl。

get_cookie：从本地文件加载cookie并应用到当前浏览器会话。c尝试从cookies.pkl文件中加载cookie，并添加到当前的WebDriver会话中。

login：根据self.login_method的值选择登录方式。

def login(self):    if self.login_method == 0:        self.driver.get(login_url)        print('###开始登录###')    elif self.login_method == 1:        self.set_cookie()

如果是模拟登录（self.login_method == 0），则直接访问登录页面。
如果是使用Cookie登录（self.login_method == 1），则调用set_cookie方法。

enter_concert：打开浏览器，进入中塑在线网站，并进行登录。

def enter_concert(self):    """打开浏览器"""    print('###打开浏览器，进入中塑在线网站###')    self.login()  # 先登录    self.driver.refresh()  # 刷新页面    self.status = 2  # 登录成功标识    print("###登录成功###")    if self.isElementExist('/html/body/div[4]/div[1]/div[2]/a[2]'):        self.driver.find_element(By.XPATH, '/html/body/div[4]/div[1]/div[2]/a[2]').click()

调用login方法进行登录。
刷新页面，并设置状态为登录成功。
如果页面上存在特定元素，则点击该元素。

isElementExist：检查页面上是否存在特定的元素。

def isElementExist(self, element):    try:        self.driver.find_element(By.XPATH, element)        return True    except:        return False

尝试通过XPath查找元素，如果找到则返回True，否则返回False。

read_key：从用户输入的Excel文件中读取关键字和其他相关信息。

#读取关键字文件def read_key(self):    file_path=input("输入关键字表格的文件名称：")    try:        # 使用pandas读取Excel文件        df = pd.read_excel(file_path)                    # 将起始时间和终止时间转换为yyyy-mm-dd格式的字符串        df['起始时间'] = pd.to_datetime(df['起始时间']).dt.strftime('%Y-%m-%d')        df['终止时间'] = pd.to_datetime(df['终止时间']).dt.strftime('%Y-%m-%d')                    # 将每个项目信息以字典的形式存储在一个列表中        projects = df.apply(lambda row: {            '产品名称': row['产品名称'],            '关键字': row['关键字'],            '起始时间': row['起始时间'],            '终止时间': row['终止时间']        }, axis=1).tolist()                    return projects    except Exception as e:        print(f"An error occurred: {e}")        return []

提示用户输入文件名，读取Excel文件。
将时间列转换为指定格式的字符串。
将每个项目的信息存储为字典，并返回包含所有项目的列表。

search_item：根据关键字在网站上搜索商品。

def search_item(self,item):    print(f"当前关键字为：{item}")    self.driver.find_element(By.XPATH, '/html/body/div[1]/div[2]/div[1]/div[2]/div[2]/div[9]/a').click()    sleep(2)  # 等待页面加载    self.driver.find_element(By.XPATH, '/html/body/div[1]/div[2]/div[2]/div[2]/div[4]/div/form/div[1]/input').send_keys(item)    self.driver.find_element(By.XPATH, '/html/body/div[1]/div[2]/div[2]/div[2]/div[4]/div/form/div[2]').click()    sleep(2)  # 等待页面加载

打印当前搜索的关键字。
点击搜索按钮，输入关键字，并执行搜索。

get_html：获取包含特定商品名称的历史价格链接。

def get_html(self,project):    # 网页URL    item = project['关键字']    url = f'https://quote.21cp.com/home_centre/list/--.html?salesDivisionAreaSidList=&keyword={item}&quotedPriceDateRef=&isList=1'    history_price_link=0    try:        # 打开网页        self.driver.get(url)        while True:            # 等待表格元素加载完成            table = WebDriverWait(self.driver, 10).until(                EC.presence_of_element_located((By.TAG_NAME, "table"))            )                            # 定位到表格中的所有行            rows = table.find_elements(By.TAG_NAME, 'tr')            # 提取包含对应产品名称的历史价格链接            for row in rows:                # 定位产品名称所在的单元格                product_name_cells = row.find_elements(By.XPATH, ".//td[1]/a")                for cell in product_name_cells:                    if project['产品名称'] in cell.text:                        # 使用显式等待来确保链接是可见的                        try:                            first_link = WebDriverWait(row, 5).until(                                EC.presence_of_element_located((By.XPATH, ".//a[contains(., '历史价格')]"))                            )                            history_price_link=first_link.get_attribute('href')                                                                                    except:                            # 如果在某个行中没有找到链接，继续下一行                            continue            try:                # 获取下一页按钮                next_page = WebDriverWait(self.driver, 10).until(                    EC.presence_of_element_located((By.XPATH, "//*[@id='page']/div/a[2]"))                )                                        # 检查下一页按钮是否被禁用                if "layui-disabled" in next_page.get_attribute("class"):                    break  # 如果是最后一页，则退出循环                # 如果下一页按钮没有被禁用，则点击它                self.driver.execute_script("arguments[0].click();", next_page)            except:                return history_price_link                        #print(history_price_link)                    return history_price_link    except Exception as e:        print("An error occurred:", e)        return 0

构造包含关键字的URL，打开网页。
等待表格元素加载，并提取包含对应产品名称的历史价格链接。

set_time：在页面上设置起始和终止日期。

def set_time(self,project):    start_date=project['起始时间']    end_date=project['终止时间']    # 输入起始日期    start_date_input = self.driver.find_element(By.XPATH, '//*[@id="startDate"]')    start_date_input.clear()    start_date_input.send_keys(start_date)    # 输入终止日期    end_date_input = self.driver.find_element(By.XPATH, '//*[@id="endDate"]')    end_date_input.clear()    end_date_input.send_keys(end_date)    if self.isElementExist('/html/body/div[1]/div[2]/div[2]/div[2]/div[2]/div[1]/form/div[3]'):        self.driver.find_element(By.XPATH, '/html/body/div[1]/div[2]/div[2]/div[2]/div[2]/div[1]/form/div[3]').click()

获取日期输入框元素，清除原有内容，并输入项目中的起始和终止日期。

get_data_simple：根据项目信息抓取数据并保存为Excel文件。

def get_data_simple(self,project):    link=self.get_html(project)    if link:        self.driver.get(link)        sleep(2)        self.driver.find_element(By.XPATH, '/html/body/div[1]/div[2]/div[2]/div[2]/div[2]/div[1]/form/div[1]/div[1]').click()        sleep(2)        data = []        # 获取页面左上角标题        page_title = self.driver.find_element(By.XPATH, '/html/body/div[1]/div[2]/div[2]/div[2]/div[1]/div[1]/div[1]').text        file_name = page_title.replace(' ', '_').replace('/', '_').replace('\\', '_') + '.xlsx'        item_folder_path = os.path.join(os.getcwd(), '爬取文件')  # 获取当前工作目录，并创建item文件夹的路径                    # 检查“爬取文件”文件夹是否存在，如果不存在则创建        if not os.path.exists(item_folder_path):            os.makedirs(item_folder_path)        print(f"当前项目为：{page_title}")                                self.set_time(project)        while True:            try:                # 等待表格元素加载完成                table = WebDriverWait(self.driver, 10).until(                    EC.presence_of_element_located((By.XPATH, "/html/body/div[1]/div[2]/div[2]/div[2]/div[2]/table"))                )                # 定位到表格中的所有行                rows = table.find_elements(By.TAG_NAME, 'tr')                # 提取表格数据                for row in rows[1:-1]:  # 第一行是表头，从第二行开始是数据                    cols = row.find_elements(By.TAG_NAME, 'td')                    cols_data = [col.text for col in cols]                    data.append(cols_data)                # 获取下一页按钮                next_page = WebDriverWait(self.driver, 10).until(                    EC.presence_of_element_located((By.XPATH, "//*[@id='page']/div/a[3]"))                )                #print(next_page.get_attribute("class"))                # 检查下一页按钮是否被禁用                if "layui-disabled" in next_page.get_attribute("class"):                    break  # 如果是最后一页，则退出循环                # 如果下一页按钮没有被禁用，则点击它                self.driver.execute_script("arguments[0].click();", next_page)            except TimeoutException:                print('此项目没有数据或日期超出范围，输出空文件')                print("------------------------------------------")                # 如果在指定时间内找不到下一页按钮，也认为是最后一页                break                        # 将文件保存到文件夹里        file_path = os.path.join(item_folder_path, file_name)  # 组合完整的文件路径        # 将数据转换为DataFrame        df = pd.DataFrame(data, columns=['更新日期', '价格(元/吨)', '涨跌', '涨跌值', '备注'])            # 存储到Excel文件        df.to_excel(file_path, index=False, engine='openpyxl')        print(f"数据已成功输出到{file_path}")        print("------------------------------------------")                else:        print("当前项目不存在，此项目信息为：")        print(project)        # 当前项目不存在时，将project信息追加存储到一个新的xlsx文件中        self.save_project_info_to_excel(project)

获取历史价格链接，打开链接，设置时间范围。
抓取表格数据，保存到Excel文件中。

save_project_info_to_excel：如果项目不存在，将项目信息保存到Excel文件中。

def save_project_info_to_excel(self, project):    # 文件名和路径    file_name = '未找到项目信息.xlsx'    item_folder_path = os.path.join(os.getcwd(), '爬取文件')    if not os.path.exists(item_folder_path):        os.makedirs(item_folder_path)    file_path = os.path.join(item_folder_path, file_name)    # 检查文件是否存在，如果不存在则创建一个新的DataFrame    if not os.path.exists(file_path):        df = pd.DataFrame(columns=['产品名称', '关键字'])    else:        # 如果文件存在，读取文件内容        df = pd.read_excel(file_path)    # 将新的项目信息追加到DataFrame中    new_project_info = pd.DataFrame([project])    df = pd.concat([df, new_project_info], ignore_index=True)    # 存储到Excel文件    df.to_excel(file_path, index=False, engine='openpyxl')    print(f"项目信息已成功追加到{file_path}")    print("------------------------------------------")

创建或读取未找到项目信息.xlsx文件，将项目信息追加到文件中。

get_data：遍历项目列表，对每个项目调用get_data_simple方法抓取数据。

def get_data(self,projects):    for project in projects:        try:            self.get_data_simple(project)        except Exception as e:            print(f"An error occurred while processing project {project}: {e}")    print('导出完成')

处理异常，并在所有项目处理完成后打印完成信息。

主函数

if __name__ == '__main__':    try:        con = Concert()        con.enter_concert()        projects=con.read_key()        con.get_data(projects)    except Exception as e:        print(e)

3、运行步骤

初始化WebDriver：
- 实例化Concert类，创建Chrome WebDriver，设置浏览器启动时最大化窗口。
登录网站：
- 访问登录页面，等待页面加载，并提示用户扫码登录。
- 扫码成功后，获取cookie，保存到本地文件cookies.pkl。
- 然后访问目标URL。
- 根据self.login_method的值选择登录方式。
- 如果选择Cookie登录（self.login_method == 1），则执行set_cookie方法：
- 如果选择模拟登录（self.login_method == 0），则直接访问登录页面。
进入目标网站：
- 打开浏览器，进入中塑在线网站。
- 登录后刷新页面，将状态设置为登录成功。
- 如果页面上存在特定元素，则点击该元素。
- 执行enter_concert方法：
读取关键字文件：
- 提示用户输入关键字表格的文件名称。
- 使用pandas读取Excel文件，并将时间列转换为指定格式。
- 将每个项目信息以字典形式存储在列表中。
- 执行read_key方法：
搜索商品：
- 点击搜索按钮，输入关键字，并执行搜索。
- 等待搜索结果页面加载。
- 对于每个关键字，执行search_item方法：
获取商品历史价格链接：
- 构造包含关键字的URL，打开网页。
- 提取包含对应产品名称的历史价格链接。
- 对于每个搜索结果，执行get_html方法：
设置时间范围：
- 在页面上设置起始和终止日期。
- 执行set_time方法：
抓取数据并保存为Excel文件：
- 根据项目信息抓取数据。
- 如果数据存在，则保存到Excel文件中。
- 如果项目不存在，则调用save_project_info_to_excel方法，将项目信息保存到Excel文件中。
- 执行get_data_simple方法：
遍历项目列表并抓取数据：
- 遍历项目列表，对每个项目调用get_data_simple方法抓取数据。
- 处理异常，并在所有项目处理完成后打印完成信息。
- 执行get_data方法：
主程序执行：
- 创建Concert类的实例。
- 执行登录，读取关键字文件。
- 抓取数据。

4、所有代码

import osimport picklefrom time import sleepfrom selenium import webdriverfrom selenium.webdriver.common.by import Byfrom selenium.webdriver.support.ui import WebDriverWaitfrom selenium.webdriver.support import expected_conditions as ECimport pandas as pdfrom selenium.common.exceptions import TimeoutExceptionfrom selenium.webdriver.chrome.options import Optionsfrom selenium import webdriver# 行情网主页hangqing_url = "https://quote.21cp.com/"# 登录页login_url = "https://passport.21cp.com/auth/realms/zs-web/protocol/openid-connect/auth?response_type=code&client_id=zs-price-quote-web&redirect_uri=https%3A%2F%2Fquote.21cp.com%2Fsso%2Flogin&state=4e8a01e9-170f-4dda-bc27-8b43f9255aa2&login=true&scope=openid"# 石化出厂价目标页target_url = 'https://quote.21cp.com/home_centre/list/--.html'class Concert:    def __init__(self):        self.status = 0         # 状态,表示如今进行到何种程度        self.login_method = 1   # {0:模拟登录,1:Cookie登录}自行选择登录方式        chrome_options = Options()        chrome_options.add_argument("start-maximized")  # 启动时最大化窗口        self.driver = webdriver.Chrome(options=chrome_options)    def set_cookie(self):        self.driver.get(login_url)        sleep(2)  # 等待页面加载        while self.driver.title.find('账号登录-中塑在线-21世纪塑料行业门户') == -1:            sleep(1)        print('###请扫码登录###')        # 最大化浏览器窗口        self.driver.maximize_window()        sleep(2)  # 等待页面加载        while self.driver.title == '账号登录-中塑在线-21世纪塑料行业门户':            sleep(1)        print("###扫码成功###")        cookies = self.driver.get_cookies()        with open("cookies.pkl", "wb") as f:            pickle.dump(cookies, f)        print("###Cookie保存成功###")        self.driver.get(target_url)    def get_cookie(self):        try:            cookies = pickle.load(open("cookies.pkl", "rb"))  # 载入cookie            for cookie in cookies:                self.driver.add_cookie(cookie)            print('###载入Cookie###')        except Exception as e:            print(e)    def login(self):        if self.login_method == 0:            self.driver.get(login_url)            print('###开始登录###')        elif self.login_method == 1:            self.set_cookie()                    def enter_concert(self):        """打开浏览器"""        print('###打开浏览器，进入中塑在线网站###')        self.login()  # 先登录        self.driver.refresh()  # 刷新页面        self.status = 2  # 登录成功标识        print("###登录成功###")        if self.isElementExist('/html/body/div[4]/div[1]/div[2]/a[2]'):            self.driver.find_element(By.XPATH, '/html/body/div[4]/div[1]/div[2]/a[2]').click()    def isElementExist(self, element):        try:            self.driver.find_element(By.XPATH, element)            return True        except:            return False        #读取关键字文件    def read_key(self):        file_path=input("输入关键字表格的文件名称：")        try:            # 使用pandas读取Excel文件            df = pd.read_excel(file_path)                        # 将起始时间和终止时间转换为yyyy-mm-dd格式的字符串            df['起始时间'] = pd.to_datetime(df['起始时间']).dt.strftime('%Y-%m-%d')            df['终止时间'] = pd.to_datetime(df['终止时间']).dt.strftime('%Y-%m-%d')                        # 将每个项目信息以字典的形式存储在一个列表中            projects = df.apply(lambda row: {                '产品名称': row['产品名称'],                '关键字': row['关键字'],                '起始时间': row['起始时间'],                '终止时间': row['终止时间']            }, axis=1).tolist()                        return projects        except Exception as e:            print(f"An error occurred: {e}")            return []    def search_item(self,item):        print(f"当前关键字为：{item}")        self.driver.find_element(By.XPATH, '/html/body/div[1]/div[2]/div[1]/div[2]/div[2]/div[9]/a').click()        sleep(2)  # 等待页面加载        self.driver.find_element(By.XPATH, '/html/body/div[1]/div[2]/div[2]/div[2]/div[4]/div/form/div[1]/input').send_keys(item)        self.driver.find_element(By.XPATH, '/html/body/div[1]/div[2]/div[2]/div[2]/div[4]/div/form/div[2]').click()        sleep(2)  # 等待页面加载    def get_html(self,project):        # 网页URL        item = project['关键字']        url = f'https://quote.21cp.com/home_centre/list/--.html?salesDivisionAreaSidList=&keyword={item}&quotedPriceDateRef=&isList=1'        history_price_link=0        try:            # 打开网页            self.driver.get(url)            while True:                # 等待表格元素加载完成                table = WebDriverWait(self.driver, 10).until(                    EC.presence_of_element_located((By.TAG_NAME, "table"))                )                                # 定位到表格中的所有行                rows = table.find_elements(By.TAG_NAME, 'tr')                # 提取包含对应产品名称的历史价格链接                for row in rows:                    # 定位产品名称所在的单元格                    product_name_cells = row.find_elements(By.XPATH, ".//td[1]/a")                    for cell in product_name_cells:                        if project['产品名称'] in cell.text:                            # 使用显式等待来确保链接是可见的                            try:                                first_link = WebDriverWait(row, 5).until(                                    EC.presence_of_element_located((By.XPATH, ".//a[contains(., '历史价格')]"))                                )                                history_price_link=first_link.get_attribute('href')                                                                                        except:                                # 如果在某个行中没有找到链接，继续下一行                                continue                try:                    # 获取下一页按钮                    next_page = WebDriverWait(self.driver, 10).until(                        EC.presence_of_element_located((By.XPATH, "//*[@id='page']/div/a[2]"))                    )                                            # 检查下一页按钮是否被禁用                    if "layui-disabled" in next_page.get_attribute("class"):                        break  # 如果是最后一页，则退出循环                    # 如果下一页按钮没有被禁用，则点击它                    self.driver.execute_script("arguments[0].click();", next_page)                except:                    return history_price_link                            #print(history_price_link)                        return history_price_link        except Exception as e:            print("An error occurred:", e)            return 0    def set_time(self,project):        start_date=project['起始时间']        end_date=project['终止时间']        # 输入起始日期        start_date_input = self.driver.find_element(By.XPATH, '//*[@id="startDate"]')        start_date_input.clear()        start_date_input.send_keys(start_date)        # 输入终止日期        end_date_input = self.driver.find_element(By.XPATH, '//*[@id="endDate"]')        end_date_input.clear()        end_date_input.send_keys(end_date)        if self.isElementExist('/html/body/div[1]/div[2]/div[2]/div[2]/div[2]/div[1]/form/div[3]'):            self.driver.find_element(By.XPATH, '/html/body/div[1]/div[2]/div[2]/div[2]/div[2]/div[1]/form/div[3]').click()    def get_data_simple(self,project):        link=self.get_html(project)        if link:            self.driver.get(link)            sleep(2)            self.driver.find_element(By.XPATH, '/html/body/div[1]/div[2]/div[2]/div[2]/div[2]/div[1]/form/div[1]/div[1]').click()            sleep(2)            data = []            # 获取页面左上角标题            page_title = self.driver.find_element(By.XPATH, '/html/body/div[1]/div[2]/div[2]/div[2]/div[1]/div[1]/div[1]').text            file_name = page_title.replace(' ', '_').replace('/', '_').replace('\\', '_') + '.xlsx'            item_folder_path = os.path.join(os.getcwd(), '爬取文件')  # 获取当前工作目录，并创建item文件夹的路径                        # 检查“爬取文件”文件夹是否存在，如果不存在则创建            if not os.path.exists(item_folder_path):                os.makedirs(item_folder_path)            print(f"当前项目为：{page_title}")                                    self.set_time(project)            while True:                try:                    # 等待表格元素加载完成                    table = WebDriverWait(self.driver, 10).until(                        EC.presence_of_element_located((By.XPATH, "/html/body/div[1]/div[2]/div[2]/div[2]/div[2]/table"))                    )                    # 定位到表格中的所有行                    rows = table.find_elements(By.TAG_NAME, 'tr')                    # 提取表格数据                    for row in rows[1:-1]:  # 第一行是表头，从第二行开始是数据                        cols = row.find_elements(By.TAG_NAME, 'td')                        cols_data = [col.text for col in cols]                        data.append(cols_data)                    # 获取下一页按钮                    next_page = WebDriverWait(self.driver, 10).until(                        EC.presence_of_element_located((By.XPATH, "//*[@id='page']/div/a[3]"))                    )                    #print(next_page.get_attribute("class"))                    # 检查下一页按钮是否被禁用                    if "layui-disabled" in next_page.get_attribute("class"):                        break  # 如果是最后一页，则退出循环                    # 如果下一页按钮没有被禁用，则点击它                    self.driver.execute_script("arguments[0].click();", next_page)                except TimeoutException:                    print('此项目没有数据或日期超出范围，输出空文件')                    print("------------------------------------------")                    # 如果在指定时间内找不到下一页按钮，也认为是最后一页                    break                            # 将文件保存到文件夹里            file_path = os.path.join(item_folder_path, file_name)  # 组合完整的文件路径            # 将数据转换为DataFrame            df = pd.DataFrame(data, columns=['更新日期', '价格(元/吨)', '涨跌', '涨跌值', '备注'])                # 存储到Excel文件            df.to_excel(file_path, index=False, engine='openpyxl')            print(f"数据已成功输出到{file_path}")            print("------------------------------------------")                    else:            print("当前项目不存在，此项目信息为：")            print(project)            # 当前项目不存在时，将project信息追加存储到一个新的xlsx文件中            self.save_project_info_to_excel(project)    def save_project_info_to_excel(self, project):        # 文件名和路径        file_name = '未找到项目信息.xlsx'        item_folder_path = os.path.join(os.getcwd(), '爬取文件')        if not os.path.exists(item_folder_path):            os.makedirs(item_folder_path)        file_path = os.path.join(item_folder_path, file_name)        # 检查文件是否存在，如果不存在则创建一个新的DataFrame        if not os.path.exists(file_path):            df = pd.DataFrame(columns=['产品名称', '关键字'])        else:            # 如果文件存在，读取文件内容            df = pd.read_excel(file_path)        # 将新的项目信息追加到DataFrame中        new_project_info = pd.DataFrame([project])        df = pd.concat([df, new_project_info], ignore_index=True)        # 存储到Excel文件        df.to_excel(file_path, index=False, engine='openpyxl')        print(f"项目信息已成功追加到{file_path}")        print("------------------------------------------")            def get_data(self,projects):        for project in projects:            try:                self.get_data_simple(project)            except Exception as e:                print(f"An error occurred while processing project {project}: {e}")        print('导出完成')if __name__ == '__main__':    try:        con = Concert()        con.enter_concert()        projects=con.read_key()        con.get_data(projects)    except Exception as e:        print(e)

Tips：运行之后只需要手动完成登录操作，和表格文件名输入操作即可。爬取就交给程序啦。

作者的话：

制作不易，点赞or转发or再看or赞赏or关注支持一下呗

株洲在线论坛招聘求职_互联网推广方案怎么写_百度竞价推广开户内容_泰安做网站公司哪家比较好

最新新闻

热搜词